Case Study - MovieLens

To show pandas in a more "applied" sense, let's use it to answer some questions about the MovieLens dataset.
The dataset contains 100,000 ratings made by 943 users on 1,682 movies.

The MovieLens data is a good example for this because it has a lot of inter-ralationships that will requiere joining/grouping :

  • the datasets users and ratings are linked together by a key (in this case, the user_id and movie_id).
  • a rating requires both a user and a movie.
  • a user can be associated with zero or many ratings and movies.
  • a movie can be rated zero or many times, by a number of different users.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)

In [3]:
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('data/ml-100k/u.user', sep='|', names=u_cols)

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('data/ml-100k/u.data', sep='\t', names=r_cols)

# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_cols, usecols=range(5))

# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
lens = pd.merge(movie_ratings, users)

Summary

  1. What are the 10 most rated movies ?
  2. Which movies are most controversial amongst different age ?

In [5]:
print movies.head(3), '\n'
print ratings.head(3), '\n'
print users.head(3)


   movie_id              title release_date  video_release_date  \
0         1   Toy Story (1995)  01-Jan-1995                 NaN   
1         2   GoldenEye (1995)  01-Jan-1995                 NaN   
2         3  Four Rooms (1995)  01-Jan-1995                 NaN   

                                            imdb_url  
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...  
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...  
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...   

   user_id  movie_id  rating  unix_timestamp
0      196       242       3       881250949
1      186       302       3       891717742
2       22       377       1       878887116 

   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067 

   movie_id              title release_date  video_release_date  \
0         1   Toy Story (1995)  01-Jan-1995                 NaN   
1         4  Get Shorty (1995)  01-Jan-1995                 NaN   
2         5     Copycat (1995)  01-Jan-1995                 NaN   

                                            imdb_url  user_id  rating  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...      308       4   
1  http://us.imdb.com/M/title-exact?Get%20Shorty%...      308       5   
2  http://us.imdb.com/M/title-exact?Copycat%20(1995)      308       4   

   unix_timestamp  age sex occupation zip_code  
0       887736532   60   M    retired    95076  
1       887737890   60   M    retired    95076  
2       887739608   60   M    retired    95076  

In [6]:
print lens.head()


   movie_id                  title release_date  video_release_date  \
0         1       Toy Story (1995)  01-Jan-1995                 NaN   
1         4      Get Shorty (1995)  01-Jan-1995                 NaN   
2         5         Copycat (1995)  01-Jan-1995                 NaN   
3         7  Twelve Monkeys (1995)  01-Jan-1995                 NaN   
4         8            Babe (1995)  01-Jan-1995                 NaN   

                                            imdb_url  user_id  rating  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...      308       4   
1  http://us.imdb.com/M/title-exact?Get%20Shorty%...      308       5   
2  http://us.imdb.com/M/title-exact?Copycat%20(1995)      308       4   
3  http://us.imdb.com/M/title-exact?Twelve%20Monk...      308       4   
4     http://us.imdb.com/M/title-exact?Babe%20(1995)      308       5   

   unix_timestamp  age sex occupation zip_code  
0       887736532   60   M    retired    95076  
1       887737890   60   M    retired    95076  
2       887739608   60   M    retired    95076  
3       887738847   60   M    retired    95076  
4       887736696   60   M    retired    95076  

In [9]:
lens.describe()


Out[9]:
movie_id video_release_date user_id rating unix_timestamp age
count 100000.000000 0 100000.00000 100000.000000 1.000000e+05 100000.000000
mean 425.530130 NaN 462.48475 3.529860 8.835289e+08 32.969850
std 330.798356 NaN 266.61442 1.125674 5.343856e+06 11.562623
min 1.000000 NaN 1.00000 1.000000 8.747247e+08 7.000000
25% 175.000000 NaN 254.00000 3.000000 8.794487e+08 24.000000
50% 322.000000 NaN 447.00000 4.000000 8.828269e+08 30.000000
75% 631.000000 NaN 682.00000 4.000000 8.882600e+08 40.000000
max 1682.000000 NaN 943.00000 5.000000 8.932866e+08 73.000000

What are the 10 most rated movies ?

  1. Split the DataFrame into groups by movie title
  2. Apply the size method to get the count of records in each group
  3. Order our results in descending order
  4. Limit the output to the top 10 using Python's slicing syntax.

using .groupby()


In [5]:
most_rated = lens.groupby('title').size().order(ascending=False)[:10]
print most_rated


title
Star Wars (1977)                 583
Contact (1997)                   509
Fargo (1996)                     508
Return of the Jedi (1983)        507
Liar Liar (1997)                 485
English Patient, The (1996)      481
Scream (1996)                    478
Toy Story (1995)                 452
Air Force One (1997)             431
Independence Day (ID4) (1996)    429
dtype: int64

using .value_count()


In [7]:
# using .value_count()
lens.title.value_counts()[:10]


Out[7]:
Star Wars (1977)                 583
Contact (1997)                   509
Fargo (1996)                     508
Return of the Jedi (1983)        507
Liar Liar (1997)                 485
English Patient, The (1996)      481
Scream (1996)                    478
Toy Story (1995)                 452
Air Force One (1997)             431
Independence Day (ID4) (1996)    429
dtype: int64

Which movies are most highly rated ?

  1. Split the DataFrame into groups by movie title
  2. Apply the size and mean to each group using .agg()method
  3. Order the results by average rating in descending order
  4. Keep only movies with at least 100 ratings

We can use the .agg() method to pass a dictionary specifying the columns to aggregate (as keys) and a list of functions we'd like to apply.


In [25]:
movie_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]})
print movie_stats.head()


                          rating          
                            size      mean
title                                     
'Til There Was You (1997)      9  2.333333
1-900 (1994)                   5  2.600000
101 Dalmatians (1996)        109  2.908257
12 Angry Men (1957)          125  4.344000
187 (1997)                    41  3.024390

Let's sort the resulting DataFrame so that we can see which movies have the highest average score.
Because movie_stats is a DataFrame, we use the sort method (only Series objects use order).
Additionally, because our columns are now a MultiIndex, we need to pass in a tuple specifying how to sort.


In [26]:
print movie_stats.sort([('rating', 'mean')], ascending=False).head()


                                           rating     
                                             size mean
title                                                 
They Made Me a Criminal (1939)                  1    5
Marlene Dietrich: Shadow and Light (1996)       1    5
Saint of Fort Washington, The (1993)            2    5
Someone Else's America (1995)                   1    5
Star Kid (1997)                                 3    5

The above movies are rated so rarely that we can't count them as quality films. Let's only look at movies that have been rated at least 100 times.


In [55]:
atleast_100 = movie_stats['rating']['size'] > 100
print movie_stats[atleast_100].sort([('rating', 'mean')], ascending=False).head()


                                 rating          
                                   size      mean
title                                            
Close Shave, A (1995)               112  4.491071
Schindler's List (1993)             298  4.466443
Wrong Trousers, The (1993)          118  4.466102
Casablanca (1942)                   243  4.456790
Shawshank Redemption, The (1994)    283  4.445230

Which movies are most controversial amongst different age ?

  1. Bin users into age groups using pandas.cut()
  2. Split the DataFrame into groups by age groups
  3. Apply the size and mean to each group using .agg()method
  4. Order the results by average rating in descending order

Limiting our population going forward
Going forward, let's only look at the 50 most rated movies. Let's make a Series of movies that meet this threshold so we can use it for filtering later.


In [8]:
most_50 = lens.groupby('movie_id').size().order(ascending=False)[:50]

Let's look at how these movies are viewed across different age groups. First, let's look at how age is distributed amongst our users.


In [9]:
users.age.hist(bins=30)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('age');

In [10]:
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
bins = range(0, 81, 10) # [0, 10, 20, 30, 40, 50, 60, 70, 80]
lens['age_group'] = pd.cut(lens.age, bins, right=False, labels=labels)
print lens[['age', 'age_group']].drop_duplicates()[:5] # preview of age bin


     age age_group
0     60     60-69
397   21     20-29
459   33     30-39
524   30     30-39
782   23     20-29

Now we can now compare ratings across age groups.


In [62]:
print lens.groupby('age_group').agg({'rating': [np.size, np.mean]})


          rating          
            size      mean
age_group                 
0-9           43  3.767442
10-19       8181  3.486126
20-29      39535  3.467333
30-39      25696  3.554444
40-49      15021  3.591772
50-59       8704  3.635800
60-69       2623  3.648875
70-79        197  3.649746

Young users seem a bit more critical than other age groups. Let's look at how the 50 most rated movies are viewed across each age group. We can use the most_50 Series we created earlier for filtering.


In [63]:
lens.set_index('movie_id', inplace=True)

In [64]:
by_age = lens.ix[most_50.index].groupby(['title', 'age_group'])
by_age.rating.mean().head(15)


Out[64]:
title                 age_group
Air Force One (1997)  10-19        3.647059
                      20-29        3.666667
                      30-39        3.570000
                      40-49        3.555556
                      50-59        3.750000
                      60-69        3.666667
                      70-79        3.666667
Alien (1979)          10-19        4.111111
                      20-29        4.026087
                      30-39        4.103448
                      40-49        3.833333
                      50-59        4.272727
                      60-69        3.500000
                      70-79        4.000000
Aliens (1986)         10-19        4.050000
Name: rating, dtype: float64

Notice that both the title and age group are indexes here, with the average rating value being a Series. This is going to produce a really long list of values.

Wouldn't it be nice to see the data as a table? Each title as a row, each age group as a column, and the average rating in each cell.

Behold! The magic of unstack!


In [66]:
by_age.rating.mean().unstack(1)[10:20]


Out[66]:
age_group 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79
title
E.T. the Extra-Terrestrial (1982) NaN 3.680000 3.609091 3.806818 4.160000 4.368421 4.375000 NaN
Empire Strikes Back, The (1980) 4 4.642857 4.311688 4.052083 4.100000 3.909091 4.250000 5.000000
English Patient, The (1996) 5 3.739130 3.571429 3.621849 3.634615 3.774648 3.904762 4.500000
Fargo (1996) NaN 3.937500 4.010471 4.230769 4.294118 4.442308 4.000000 4.333333
Forrest Gump (1994) 5 4.047619 3.785714 3.861702 3.847826 4.000000 3.800000 NaN
Fugitive, The (1993) NaN 4.320000 3.969925 3.981481 4.190476 4.240000 3.666667 NaN
Full Monty, The (1997) NaN 3.421053 4.056818 3.933333 3.714286 4.146341 4.166667 3.500000
Godfather, The (1972) NaN 4.400000 4.345070 4.412844 3.929412 4.463415 4.125000 NaN
Groundhog Day (1993) NaN 3.476190 3.798246 3.786667 3.851064 3.571429 3.571429 4.000000
Independence Day (ID4) (1996) NaN 3.595238 3.291429 3.389381 3.718750 3.888889 2.750000 NaN

unstack, well, unstacks the specified level of a MultiIndex (by default, groupby turns the grouped field into an index - since we grouped by two fields, it became a MultiIndex). We unstacked the second index (remember that Python uses 0-based indexes), and then filled in NULL values with 0.

If we would have used:

by_age.rating.mean().unstack(0).fillna(0)

We would have had our age groups as rows and movie titles as columns.

Which movies do men and women most disagree on?

DataFrame's have a pivot_table method that makes these kinds of operations much easier (and less verbose).


In [67]:
lens.reset_index('movie_id', inplace=True)

In [78]:
pivoted = pd.pivot_table(lens, values='rating', index=['movie_id', 'title'],
...                     columns=['sex'], fill_value=0)
print pivoted.head()


sex                                F         M
movie_id title                                
1        Toy Story (1995)   3.789916  3.909910
2        GoldenEye (1995)   3.368421  3.178571
3        Four Rooms (1995)  2.687500  3.108108
4        Get Shorty (1995)  3.400000  3.591463
5        Copycat (1995)     3.772727  3.140625

In [79]:
pivoted['diff'] = pivoted.M - pivoted.F
print pivoted.head()


sex                                F         M      diff
movie_id title                                          
1        Toy Story (1995)   3.789916  3.909910  0.119994
2        GoldenEye (1995)   3.368421  3.178571 -0.189850
3        Four Rooms (1995)  2.687500  3.108108  0.420608
4        Get Shorty (1995)  3.400000  3.591463  0.191463
5        Copycat (1995)     3.772727  3.140625 -0.632102

In [80]:
pivoted.reset_index('movie_id', inplace=True)

In [81]:
disagreements = pivoted[pivoted.movie_id.isin(most_50.index)]['diff']
disagreements.order().plot(kind='barh', figsize=[9, 15])
plt.title('Male vs. Female Avg. Ratings\n(Difference > 0 = Favored by Men)')
plt.ylabel('Title')
plt.xlabel('Average Rating Difference');